cache line
DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management
Zhou, Zhongchun, Lai, Chengtao, Gu, Yuhang, Zhang, Wei
Abstract--The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and their asynchronous management, we investigate the opposite point of the design spectrum: a multi-core AI accelerator equipped with a shared system-level cache and application-aware management policies, which keeps the programming effort modest. Our approach exploits dataflow information available in the software stack to guide cache replacement (including dead-block prediction), in concert with bypass decisions and mechanisms that alleviate cache thrashing. We assess the proposal using a cycle-accurate simulator and observe substantial performance gains (up to 1.80x speedup) compared with conventional cache architectures. In addition, we build and validate an analytical model that takes into account the actual overlapping behaviors to extend the measurement results of our policies to real-world larger-scale workloads. Experiment results show that when functioning together, our bypassing and thrashing mitigation strategies can handle scenarios both with and without inter-core data sharing and achieve remarkable speedups. Finally, we implement the design in RTL and the area of our design is 0.064mm Our findings explore the potential of the shared cache design to assist the development of future AI accelerator systems. ITH the advent of the artificial intelligence (AI) era, the demand for AI-tailored hardware has surged across various environments, from data centers to embedded systems. A preliminary version of this paper appeared in the proceedings of ICS 2024. Z. Zhou and C. Lai contributed equally to this work. Z. Zhou and C. Lai are with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: zzhouch@connect.ust.hk; Gu is with the School of Electronic Science and Engineering, Southeast University, Nanjing, Jiangsu, China W . Zhang (corresponding author) is with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: eeweiz@ust.hk). Personal use of this material is permitted. These accelerators span a broad spectrum, from power-efficient devices to those designed for high computational throughput [34]. AI accelerators, compared with Graphics Processing Units (GPUs), can be optimized for AI applications and tailored for specific scenarios, such as pre-defined neural network (NN) computation graphs, operator types, certain data precision, and given power budgets. Since they are often used in scenarios where the execution graph is known during compilation, they typically employ software-controlled scratchpad memories (SPMs) as the on-chip storage.
A TRRIP Down Memory Lane: Temperature-Based Re-Reference Interval Prediction For Instruction Caching
Kao, Henry, Sreekumar, Nikhil, Soni, Prabhdeep Singh, Sedaghati, Ali, Su, Fang, Chan, Bryan, Goudarzi, Maziar, Azimi, Reza
Modern mobile CPU software pose challenges for conventional instruction cache replacement policies due to their complex runtime behavior causing high reuse distance between executions of the same instruction. Mobile code commonly suffers from large amounts of stalls in the CPU frontend and thus starvation of the rest of the CPU resources. Complexity of these applications and their code footprint are projected to grow at a rate faster than available on-chip memory due to power and area constraints, making conventional hardware-centric methods for managing instruction caches to be inadequate. We present a novel software-hardware co-design approach called TRRIP (Temperature-based Re-Reference Interval Prediction) that enables the compiler to analyze, classify, and transform code based on "temperature" (hot/cold), and to provide the hardware with a summary of code temperature information through a well-defined OS interface based on using code page attributes. TRRIP's lightweight hardware extension employs code temperature attributes to optimize the instruction cache replacement policy resulting in the eviction rate reduction of hot code. TRRIP is designed to be practical and adoptable in real mobile systems that have strict feature requirements on both the software and hardware components. TRRIP can reduce the L2 MPKI for instructions by 26.5% resulting in geomean speedup of 3.9%, on top of RRIP cache replacement running mobile code already optimized using PGO.
Memory Optimization for Convex Hull Support Point Queries
--This work has been submitted to the IEEE for possible publication. Support point queries are a critical part of many collision detection pipelines, including those for robotics and real-time graphical applications. This paper proposes several memory layout optimizations to speed up support point queries on convex hulls. These methods are implemented and tested on a variety of different hardware models, with a decrease in processing time of up to five times compared to current approaches. The results in this paper can be integrated with existing physics libraries with minimal effort. Interest in real-time robotic path planning is increasing as robotic systems become more ubiquitous and flexible, and with this advent comes the need for computationally efficient real-time collision modeling.
VoxelCache: Accelerating Online Mapping in Robotics and 3D Reconstruction Tasks
Durvasula, Sankeerth, Kiguru, Raymond, Mathur, Samarth, Xu, Jenny, Lin, Jimmy, Vijaykumar, Nandita
Real-time 3D mapping is a critical component in many important applications today including robotics, AR/VR, and 3D visualization. 3D mapping involves continuously fusing depth maps obtained from depth sensors in phones, robots, and autonomous vehicles into a single 3D representative model of the scene. Many important applications, e.g., global path planning and trajectory generation in micro aerial vehicles, require the construction of large maps at high resolutions. In this work, we identify mapping, i.e., construction and updates of 3D maps to be a critical bottleneck in these applications. The memory required and access times of these maps limit the size of the environment and the resolution with which the environment can be feasibly mapped, especially in resource constrained environments such as autonomous robot platforms and portable devices. To address this challenge, we propose VoxelCache: a hardware-software technique to accelerate map data access times in 3D mapping applications. We observe that mapping applications typically access voxels in the map that are spatially co-located to each other. We leverage this temporal locality in voxel accesses to cache indices to blocks of voxels to enable quick lookup and avoid expensive access times. We evaluate VoxelCache on popularly used mapping and reconstruction applications on both GPUs and CPUs. We demonstrate an average speedup of 1.47X (up to 1.66X) and 1.79X (up to 1.91X) on CPUs and GPUs respectively.
I Am SO Glad I'm Uncoordinated!
Sometime around 1962 or 1963, I was six or seven years old. I was trotted down to the Little League field and told I should sign up. It wasn't too bad standing in the outfield until I understood they were SERIOUS that I should move into the path of the approaching ball! The odds of me lining up the mitt to the oncoming ball were vanishingly small. I lasted less than a week in Little League. So, I took up reading books and that's served me well. I am truly glad that I'm uncoordinated! In this paper, we're going to look at how computing has evolved through the years. The act of coordinating to share stuff hurts more and more over time. We'll first examine how this has changed and continues to change. Then, we'll look at how we can reduce and sometimes eliminate these challenges over time. I've seen tremendous changes since I dropped out of college in 1976. The nature and character of our designs continue to evolve as technology advances. Stuff that used to be easy is now hard. Stuff that used to be hard is now easy. The way we write data has begun an inexorable change from updating the single copy of some data to adding new versions to a log of changes. Similar to the way accountants only add journal entries, our systems are appending their intention to make changes to a log. This means the data we store is immutable. Being immutable means, it is more easily managed in a distributed environment, avoiding the difficulties of read-modify-write. Also, we can bundle appended log writes into one batched append.
An Imitation Learning Approach for Cache Replacement
Liu, Evan Zheran, Hashemi, Milad, Swersky, Kevin, Ranganathan, Parthasarathy, Ahn, Junwhan
Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement policies typically resort to heuristics designed for specific common access patterns, which fail on more diverse and complex access patterns. In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady's, an oracle policy that computes the optimal eviction decision given the future cache accesses. While directly applying Belady's is infeasible since the future is unknown, we train a policy conditioned only on past accesses that accurately approximates Belady's even on diverse and complex access patterns, and call this approach Parrot. When evaluated on 13 of the most memory-intensive SPEC applications, Parrot increases cache miss rates by 20% over the current state of the art. In addition, on a large-scale web search benchmark, Parrot increases cache hit rates by 61% over a conventional LRU policy. We release a Gym environment to facilitate research in this area, as data is plentiful, and further advancements can have significant real-world impact.